Review

  • Tools
    • Ch13. Linear Factor Models
    • Ch14. Autoencoders
  • Overviews(?)
    • Ch15. Representation Learning
  • Specific Issues
    • Ch16. Structured Probabilistic Models for Deep Learning
    • Ch17. Monte Carlo Methods
    • Ch18. Confronting the Partition Function
    • Ch19. Approximation Inference
    • Ch20. Deep Generative Models

Contents

  • Introduction
    • Definition of Representation
  • Greedy Layer-Wise Unsupervised Pretraining
    • When and Why Does Unsupervised Pretraining Work?
  • Transfer Learning and Domain Adaptation
    • Use shared representation
  • Semi-Supervised Disentangling of Casual Factors
    • Use information from unsupervised tasks to perform supervised task
  • Distributed Representation
  • Exponential Gains from Depth
    • Deep representation
  • Providing Clues to Discover Underlying Causes

Introduction

Representation

  • Arabic numeral representation VS Roman numeral representation
    • 210 / 6 VS CCX / VI
  • Better representation in Machine Learning
    • Good one makes a subsequent task easier
  • Almost learning algorithms learns "Representations" in the Deep Architecture
    • Supervised/Unsupervised Learning learns "implicitly" as side effects
    • Some algorithms designed explicitly for Representation Learning
      • e.g. Distribution Learing (Density Estimation)
  • Tradeoff Issue
    • Preserving much information VS Nice properties (e.g. Independence)

Use Unlabeled Data for a good representation

  • Unsupervised Learning
  • Semi-suprevised learning

15.1 Greedy Layer-Wise Unsupervised Pretraining

참고 이미지: https://wikidocs.net/images/page/3413/glw.png (출처: https://wikidocs.net/3413)

  • Greedy Layer-Wise?

    • optimizes each layer at a time rather than jointly optimizing all pices
  • Use single-layer representation learning algorithm

    • RBM, single-layer autoencoder, sparse coding model (Ch13/14)
    • Take the output of the previous layer
    • Produce a new simpler representation

  • Good Initialization for a joint learning procedure over all the layers of a deep neural net for supervised task
  • Used to successfully train "even" fully connected architectures
  • Fine tuning after pretraining
    • Optimizes all layers together
    • Can be done in the pretraining phase (pretraining & fine-tuning simultaneously)
  • Can be viewed as a regularizer in supervised learning task
  • Overall training scheme is nearly the same
    • learning algorithms, model types can differ
  • Initialization for unsupervised learning algorithms for...
    • Deep autoencoders
    • Probailistic models with many layers of latent variables
    • Deep Generative Models (Ch20)
      • Deep belief networks
      • Deep Boltzmann machines

15.1.1 When and Why Does Unsupervised Pretraining Work?

  • History
    • Substantial improvements in test error for "Classification Tasks"
      • Revival of deep neural networks (2006, Hinton)
    • Harmful on many other tasks
    • Ma,J. (2015, Deep neural nets as a method for quantitative sturucture) found...
      • Significantly helpful for many tasks
      • Slightly harmful on average
    • So we should know "When and Why pretraining works" for a particular task
  • 2 Intuitions

    • Act as regularizer
      • e.g. Optimize only higher layers(classifier) freezing lower layers (feature extractor)
      • Prevent overfitting
      • Improve test set error
      • Speed up optimization
    • Some features that are useful for the unsupervised task may also be useful for the supervised learning task
      • After extracting wheels, we can classify cars and motorcycles by counting wheels
  • Expected Values

    • More effective when the initial input is poor
      • dimension reduction + manifold learning (Ch14)
      • e.g. good similarity metrics between two words for word embeddings
    • User unlabeled data when labeled data is very small (Semi-supervised learning)
    • Regularization for complicated functions
  • Why it works
    • reduce the viraince of the estimation process
      • Figure 15.1 explanation
        • Input-output projection for visualization
        • variaous starting points (initialization)
        • blue -> red: time line, from origin to outside
        • points based on pretraing move to small region

  • Comparison to other ways
    • Two "separate" phases
      • Increase hyperparameters => time consuming
    • => one phase pretraining
      • Unsupervised learning and supervised learning simultaneously
      • Attach unsupervised learning term to objective function
  • Two phase VS one phase

    • many hyperparams vs single hyperparam
    • several trial-error iteration vs one-shot
    • no way to control regularization term vs control it by the coefficient of unsupervised cost term
  • The popularity of unsupervised pretraining has declined

    • Still popular in NLP(natural language processing)
    • Regualized with dropout or Batch normalization for classification
      • outperform pretraing versions on even medium-size datasets
    • Bayesian methods outperform on small datasets
  • Nevertheless unsupervised pretraining...
    • an important milestone in the history of deep learing research
    • continues to influence contemporary approaches

15.2 Transfer Learning and Domain Adaptation

  • One example problelm of Transfer learning

    • How to use feature extractor from Zebra vs Horse for classification of Dalmatian vs Dog
  • In transfer learning, the learner must perform two or more different tasks

    • e.g. Learn on significantly more data (P1), apply the learned transformation on P2(Small data)
  • Sharing layers

    • Share lower layers (Underlying factor in low level feature) => Multi-task learning
      • e.g. Visual categorizing
        • low-level notions of "Edges" and "Visual shapes" (corner? circle?)
    • Share higher layers (Speech recognitoin) => Domain Adaptation
  • Domain Adaptation (Sharing Higher Layer)

    • Same task -> Different distrtibution P
    • e.g. Learning positive/Negative sentiment
      • Task1: about Music, Task2: about Movies
      • Why?: vocabulary and style vary from one domain to another
  • Concept Drift
    • Gradual changes in the data distribution over time

While the phrase "multi-task learning" typically refers to supervised learning tasks, the more general notion of transfer learning is applicable to unsupervised learning and reinforcement learning as well.

  • Same representation may be useful in both settings

    • e.g. Transfer learning competition
      • Mesnil, G. 2011, Unsupervised and tranfer learning challenge: a deep learing approach
      • 1st: Learn on $P_1$
      • 2st: Apply the learned transformation to $P_2$
      • Result
        • deeper representations => faster learning $P_2$
  • Two examples: One-shot learning and zero-shot(zero-data) learning

    • Extreme forms of transfer learning
    • One-shot: One example in the 2nd stage
      • e.g.
        • learn "wheels" from images of bikes n cars
        • learn the one image of a 3-wheel bike
        • test on images of 3-wheel bikes
    • Zero-shot
      • Testing without data in the 2nd stage???
      • Learn 2 representations and their relation
      • e.g. Text-Image learning
        • Link text space("4 Legs") - Image space(visual shape of legs and their count)
        • Learn Birds("2 Legs", "No Ear"), Dogs("2 Legs", "Round Ears")
        • Input: Text about Cats (4 Legs, Pointy ears)
        • Apply to the images of Cats
      • e.g. Machine translation
        • We can translate sentences even though some word has no label
        • X in language A - Y in language B have similar behavior => Same meaning

  • Zero-shot Model

    • $P(y| x, T)$

      • Traditional input $x$
      • Traditional output $y$
      • Additional random variables, Task $T$
      • e.g. $x$ is descriptions about cats, $y$ is "yes" or "no", $T$ is "Is there a cat in this image?"

      If we have a training set containing unsupervised examples of objects that live in the same space as T , we may be able to infer the meaning of unseen instances of T.

      • $T$ should be represented in a way that allows some of generalization.
        • "Is there a sort of "animals" in this image?

15.3 Semi-Supervised Disentangling of Causal Factors

  • Large amount of unlabeled data and relatively little labeled data

  • $P(x)$ is helpful for $P(y|x)$
  • Causal Factor -(Representation)-> Feature

  • Better Representations?

    1. Representation disentangles the causes from one another
    2. Easy to model
      • e.g. Simple model: sparsity, independence
  • Hypothesis motivation of Semi-supervised learning

    • If (1), (2) conside =>
    • If a representation $h$ represents many of the underlying causes of the observed $x$
      • the outputs $y$ are among the most "salient" causes, then it is easy to predict $y$ from $h$.
      • $P(y|x)$, $P(x|h)$, $P(h)$
    • c.f. If $P(x)$ is uniformly distributed => Semi-supervised learning fails
    • Simple example

  • Issus: Hard to capture salient factors
    • Two Strategy
      1. Use a supervised learning signal (labeld data)
      2. Use much larger representation
  • Adversarial Framework (CH 20)
    • Modify the definition of which underlying causes are most salient.

15.4 Distributed Representation

  • Symbolic Representation
    • N features -> N Symbol or categories
    • e.g.
      • Red Car, Green Car, Blue Car, Red Truck, Green Truck, Blue Truck, Red Bird, Green Bird, Blue Bird
    • N "Binary" features => One hot representation
      • e.g.
        • Red Car = [1, 0, 0, 0, 0, 0, 0, 0, 0]
        • Blue Bird = [0, 0, 0, 0, 0, 0, 0, 0, 1]
      • Still "Sparse" Representation (CH 1)
  • Distributed Representation?

    • e.g.
      • Red Car = [[1, 0, 0], [1, 0, 0]] ([[Red Bit, Green Bit, Blue Bit],[Car Bit, Truck Bit, Bird Bit]])
      • Blue Bird = [[0, 0, 1], [0, 0, 1]]
    • Not all values are feasible
      • e.g. [1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0] are not feasible
  • One-hot representation VS Distribution Representation

    • Only One entry can be active VS Multiple entry can be active
    • Representation Dimension
      • $n^d$ VS $n\times d$
        • $d:= \text{Input Dim or Feature Dim}$, $n:= \text{Feature Value Dim}$
        • e.g. $3^2$ VS $2\times3$
      • Not Powerful VS Powerful
  • The combination of a Powerfule Representation Layer and a Week Classifier Layer can be a strong regularizer
    • A classifier trying to learn the concept of "person" vs "not a person" does not need to assign a different class to an input represented as "woman with glasses".
      • (person vs not a person), (man vs woman), (with glasses vs without glasses)
    • This capacity constraint encourages each classifier to focus on few $h_i$ and encourages $\mathbf{h}$ to learn to represent the classes in a linearly separable way
      • some classifier focuses (man vs woman), another one focuses (with glasses vs without glasses)
  • Non-distributed Representation
    • Example Type 1: Input point is assigned to exactly one cluster.
      • Clustering methods: K-means algorithm
      • Decision Trees
    • Example Type 2: Entries cannot be controlled separately from each other.
      • K-nearest neighbors algorithms
      • Gaussian Mixtures and Mixtures of Experts
      • Kernel machines with a Gaussian kernel
    • ???
      • N-grams + Tree of suffixes (CH 12)
  • Distibuted Represetation
    • Generalization arises due to "shared attributes"
      • "cat" and "dog"
        • "has_fur" or "number_of_legs" have same value for the embedding of both.
    • Induce a rich "Similarity Spacee"
      • Semantically close concepts are close in "Distance"
        • "cat" is closer to "dog" than "snake"
  • When and Why can there be a statistical advantage from using a distributed representation as part of a learning algorithm?

    • When
      • complicated structure can be compactly represented using a small number of parameters (dim vector size)
      • Bigger Dim of parameters -> larger degree of freedom -> larger regions -> larger data
    • Why

      • The Number of Distinguishable Regions using linear threshold units
      • $d:= \text{Input Dim}$, $n:= \text{Feature Value Dim}$
        • $\Sigma_{j=0}^{d} \binom{n}{j}= O(n^{d})$ using only $O(nd)$
      • Can extended to the case using nonlinear units
        • represent Larger regions using Smaller parameter
        • fewer example to generalize well
      • Effective in Capacity
        • If $w$ is number of weights, VC Dimension is $O(w\mathbf{log}w)$,
          • VC Dimension(Vapnik-Chervonenkis)?
            • VC of One Linear classifier is 3
      • Learning about each of them without having to see all the configurations of all the others
        • e.g. If we learn about man with glasses, man without glasses and woman without glasses
          • we can infer woman with glasses.

15.5 Exponential Gains from Depth

  • Functions can be represented by exponentially smaller deep networks compared to shallow networks

    • Small Deep networks can represents better than shallow networks
  • e.g. Generative Model

    • Need highly nonlinear ways to the input in order to generate data
    • High nonlinearity
      • Composition of many nonlinearities and a hierarchy of reused features can give an exponential boost to statistical efficiency, on top of the exponential boost given by using a Distributed Representation
      • Deep Network + Distributed Representation => High nonlinearity
      • e.g. Simple Universal Approximator (Ch 6)
        • Boolean gates, sum/products, RBF with even a single hidden layer
        • Can approximate a large class of functions
        • Expressive Power
          • Need "exponential" number of hidden units in order to have same expressive power of architecture with additional 1 depth.
      • Similar Result on...
        • Deterministic fead forward network as a universal approximators of "Probability Distribution"
          • Many structured probabilistic models with a single hidden layer of latert variables (Ch 16)
          • e.g. Boltzman machines, Deep Belief Networks
          • Deeper one can have "exponential" advantage over a shallow one.
        • sum-product network for probabilistic models (SPN)
        • Deep circuits related to convolutional networks (Convolutional sum-product network)

15.6 Providing Clues to Discover Underlying Causes

  • What makes one representation better than another?

    • One that disentangles the underlying causal factors
    • Learner separate these observed factors from the others
    • Introduce clues that help the learning to find these underlying factors from the others.
    • Type 1: Supervised Learning
      • Provides a very strong clue a label $\mathbf{y}$
    • Type 2: Using of abundant Unlabeled data
      • hints about the underlying factors
        • take the form of implicit prior beliefs that the designers of the learning algorithm impose in order to guide the learner.
        • Regularization strategies are necessary to obtain good generalization
        • One goal of deep learning is to find a set of fairly generic regularization strategies.
  • Generic Regularization Stretegies

    • Smoothness
      • $f(x + \epsilon d) \approx f(x)$ for unit $d$
      • allow to generalize from training examples to nearby points in input space
    • Linearity
      • Relationships btw some variables are linear.
      • Make predictions even very far from the observed data,
        • sometimes lead to overrly extreme prediction
      • Simple machine learning algorithms use "Linearity" instead of "Smoothness"
        • Linearity and Smoothness are different assumption in "High Dimensional" space
    • Multiple Explanatory Factors (Output)
      • Motivation $p(x) \approx p(y|x)$ in Semi-Supervised Learning
      • Distributed Representation
    • Causal Factors (Input)
      • Underlying Causal Factor $h$ in Semi-Supervised Learning
    • Depth or a Hierarchical Organization of Explanatory Factors
      • High level can be defined in terms of simple concepts forming a hierarchy.
        • Cat (High level), pointy ears, 4 legs (Lower level)
      • Multi step program
        • Each step (layer) refer back to the output via previous step (layer)
    • Shared Factors across Tasks
      • In Many tasks: differernt $\mathbf{y_i}$ outputs sharing $\mathbf{x}$ input.
        • There are $f^{i}(\mathbf{x})$ of a global $\mathbf{x}$
        • Each $\mathbf{y_i}$ is associated with a different subset from $\mathbf{h}$
          • $P(\mathbf{y_i} | \mathbf{x})$ depends on $P(\mathbf{h} | \mathbf{x})$
    • Manifolds
      • Regions in which probability mass concentrates are...
        • locally connected
        • occupy a tiny volume
      • These regions can be approximaed by Low-dimensional manifolds with a much smaller Dim
      • Motivate Autoencoders
    • Natural Clustering
      • Each manifold in the input space may be assigned to a single class.
      • The data may lie on many disconnected manifolds
      • Motivate tangent propagation, double backprop, manifold tangent classifier, adversarial training
    • Temporal and Spatial Coherence
      • Most important explanatory factors change slowly over time
    • Sparsity
      • Most features should presumabley not be relevant to describing most inputs
    • Simplicity of Factor Dependencies
      • The simplest example
        • $P(\mathbf{h}) = \Pi_i(\mathbf{h_i})$
      • Linear dependencies, dependencies captured by a autoencoder
      • Motivate many laws of physics
      • Motivate a linear predictor or a factorized prior on top of a learned representation